library(tidyverse)
library(dplyr)
library(readr)
library(readxl)
library(httr)
library(jsonlite)
library(DataExplorer)
library(ggplot2)
library(GGally)
library(mice)
library(factoextra)
library(gridExtra)
library(cluster)
library(mclust)
library(plotly)
library(MASS)
library(leaflet)
#for reproduceability
set.seed(123)
Commercial airports in the United States vary in size, infrastructure, passenger volume, environmental conditions, and the types and frequency of delays they experience. These differences affect how airports operate and the challenges they face. Understanding these patterns is important for airport management, transportation planning, and infrastructure investment. However, there is no standard way to classify airports based on their characteristics.
This study uses Principal Component Analysis (PCA) and clustering to uncover patterns in airport data. It seeks to answer the following questions:
What key dimensions define variation among airports?
How do airports naturally cluster based on their characteristics?
These findings provide a data-driven framework for understanding airport differences. Airport authorities and policymakers can use this classification to assess infrastructure needs, allocate funding, and develop policies suited to different types of airports. Airlines can use these insights to refine scheduling strategies, optimize route planning, and anticipate operational challenges at different airport types. By identifying patterns in airport characteristics, this study offers a structured way to compare airports and support decisions that improve efficiency and resource management.
First I will load the OpenFlights data to get IATA codes. I will use these codes as a key when joining other datasets, and will use the latitude and longitude to match weather data.
#load OpenFlights data
airport_info <- read_csv("airports.dat", col_names = FALSE)
## Rows: 7698 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): X2, X3, X4, X5, X6, X10, X11, X12, X13, X14
## dbl (4): X1, X7, X8, X9
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#rename columns
colnames(airport_info) <- c("airport_id", "airport_name", "city", "country", "iata", "icao", "latitude", "longitude", "altitude", "timezone", "dst", "type", "source")
#filter for relevant columns
airport_info <- airport_info |>
dplyr::select(iata, airport_name, city, country, latitude, longitude) |>
dplyr::rename(airport_code = iata)
#filter for U.S. airports only
us_airports <- airport_info |> filter(country == "United States")
Next I will load BTS data, that contains data about all types of flight delays.
## Rows: 22621 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): carrier, carrier_name, airport, airport_name
## dbl (17): year, month, arr_flights, arr_del15, carrier_ct, weather_ct, nas_c...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Then I will load data about enplanements, or the number of passengers boarding an aircraft.
airport_size <- read_excel("cy23-commercial-service-enplanements.xlsx")
airport_size <- airport_size |>
dplyr::select("Locid", "CY 23 Enplanements") |>
dplyr::rename(enplanements = `CY 23 Enplanements`)
delay_size_us <- delay_data_us |>
left_join(airport_size, by = c("airport_code" = "Locid"))
Next I will load in data about airport runways including total number, length, width, and elevation.
runways_data <- read_csv("Runways_IATA.csv")
## New names:
## Rows: 46627 Columns: 22
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): airport_ident, surface, le_ident, he_ident, iata_code dbl (17): ...1, id,
## airport_ref, length_ft, width_ft, lighted, closed, le_la...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
runways_data <- runways_data |>
filter(!is.na(iata_code)) |>
group_by(iata_code) |>
summarize(
num_runways = n(),
elevation_ft = first(le_elevation_ft), #Assumes consistent elevation at airport
total_runway_length = sum(length_ft, na.rm = TRUE),
total_runway_width = sum(width_ft, na.rm = TRUE)
)
#join data
delay_size_runways_us <- delay_size_us |>
left_join(runways_data, by = c("airport_code" = "iata_code"))
I will then load in data about the projected budget for each airport for a four-year period including 2023.
dev_estimate <- read_excel("dev estimate.xlsx", skip = 1)
## New names:
## • `` -> `...9`
## • `` -> `...11`
dev_estimate <- dev_estimate |>
dplyr::select(Locid, '2021-2025\r\nDev Estimate') |>
dplyr::rename(dev_estimate = '2021-2025\r\nDev Estimate')
#join data
delay_size_runways_dev <- delay_size_runways_us |>
left_join(dev_estimate, by = c("airport_code" = "Locid"))
Finally I will create a function to get weather data from Open-Meteo API (Note: you will only be to make 100 calls before reaching the limit for the free tier of this service. Therefore, I’ve saved the full data in the attached weather_data.csv).
#get unique cities with their coordinates
cities_coords <- delay_size_runways_dev |>
dplyr::select(city, latitude, longitude) |>
distinct()
unique_cities <- cities_coords %>%
distinct(city, latitude, longitude)
#function to pull weather data for each city
get_weather <- function(lat, lon, city) {
print(paste("Getting weather for:", city))
base_url <- "https://archive-api.open-meteo.com/v1/archive"
url <- sprintf("%s?latitude=%f&longitude=%f&start_date=2023-01-01&end_date=2023-31-12&daily=temperature_2m_max,temperature_2m_min,precipitation_sum,rain_sum,snowfall_sum&timezone=America/New_York",
base_url, lat, lon)
response <- GET(url)
Sys.sleep(0.5)
if (status_code(response) == 200) {
weather_data <- fromJSON(rawToChar(response$content))
result <- as.data.frame(weather_data$daily)
result$city <- city
return(result)
} else {
return(data.frame(
time = character(),
temperature_2m_max = numeric(),
temperature_2m_min = numeric(),
precipitation_sum = numeric(),
rain_sum = numeric(),
snowfall_sum = numeric(),
city = character()
))
}
}
#get weather for all cities
weather_data <- unique_cities %>%
rowwise() %>% # Change to rowwise() instead of group_by()
do(get_weather(.$latitude, .$longitude, .$city)) %>%
ungroup()
#save the data
write.csv(weather_data, "weather_data_2023.csv", row.names = FALSE)
After loading in the weather data, I will summarize it by year. Then I will join the weather dataset with my other data. Now we have all of our variables!
weather_data <- read_csv("weather_data.csv")
## Rows: 127750 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): city
## dbl (5): temperature_2m_max, temperature_2m_min, precipitation_sum, rain_su...
## date (1): time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weather_data <- weather_data |>
mutate(year = year(time)) |>
group_by(city, year) |>
summarize(
p90_max_temp = quantile(temperature_2m_max, 0.90, na.rm = TRUE),
p10_min_temp = quantile(temperature_2m_min, 0.10, na.rm = TRUE),
temp_range = max(temperature_2m_max, na.rm = TRUE) - min(temperature_2m_min, na.rm = TRUE),
total_rain = sum(rain_sum, na.rm = TRUE),
total_snow = sum(snowfall_sum, na.rm = TRUE),
.groups = "drop")
#join data
final_flights <- delay_size_runways_dev |>
left_join(weather_data, by = "city")
#I will use this later to create maps
airport_locations <- final_flights |>
dplyr::select(airport_code, latitude, longitude) |>
distinct()
#remove key columns that aren't useful for analysis
final_flights <- final_flights |>
dplyr::select(-year, -country, -airport_name, -latitude, -longitude, -city)
I can see that my dataset has nearly all continuous variables which
is ideal for PCA and clustering. The only non-numeric variable is
airport_code, which I will keep for my graphs later on. I
can also see that while most rows are complete, about 6% are missing
some data.
plot_intro(final_flights)
#non-numeric variables
names(final_flights)[!sapply(final_flights, is.numeric)]
## [1] "airport_code"
Even though only 6% of rows have missing data, I’ll impute with MICE rather than drop them. Removing these observations would unnecessarily shrink my sample size and could bias the analysis. Now I have no missing observations in my dataset.
#inspect missing observations per row
missing_rows <- final_flights |>
filter(if_any(everything(), is.na))
missing_rows
## # A tibble: 21 × 18
## airport_code arr_flights weather_delay_rate carrier_delay_rate nas_delay_rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ADK 104 0.337 5 0.519
## 2 AZA 5452 2.47 7.54 3.09
## 3 BQN 2898 1.53 12.2 4.21
## 4 BTM 674 0.671 2.04 0.0430
## 5 ECP 7583 0.496 4.29 1.77
## 6 FCA 4556 2.46 3.98 1.15
## 7 GPT 3852 0.888 3.60 1.68
## 8 GUM 757 0 2.91 0.535
## 9 HHH 1802 0.567 2.50 2.05
## 10 MQT 897 3.23 5.12 1.44
## # ℹ 11 more rows
## # ℹ 13 more variables: security_delay_rate <dbl>,
## # late_aircraft_delay_rate <dbl>, enplanements <dbl>, num_runways <int>,
## # elevation_ft <dbl>, total_runway_length <dbl>, total_runway_width <dbl>,
## # dev_estimate <dbl>, p90_max_temp <dbl>, p10_min_temp <dbl>,
## # temp_range <dbl>, total_rain <dbl>, total_snow <dbl>
#impute using MICE
m = 4
mice_mod <- mice(final_flights, m=m, method='rf')
##
## iter imp variable
## 1 1 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 1 2 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 1 3 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 1 4 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 2 1 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 2 2 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 2 3 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 2 4 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 3 1 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 3 2 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 3 3 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 3 4 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 4 1 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 4 2 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 4 3 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 4 4 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 5 1 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 5 2 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 5 3 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## 5 4 enplanements elevation_ft dev_estimate p90_max_temp p10_min_temp temp_range total_rain total_snow
## Warning: Number of logged events: 1
final_flights <- complete(mice_mod, action=m)
#check that missing values are filled
sum(is.na(final_flights))
## [1] 0
Next I will check feature distribution to determine if any variables
should be transformed. I observe that some variables, such as
arr_flights, enplanements, and
dev_estimate are highly skewed. Therefore, I will apply a
logarithmic transformation to these variables. Additionally, my features
have different scales. Therefore, I will ensure all features are
standardized when running PCA and clustering using scale(),
so they contribute equally.
#plot histograms
df_long <- final_flights |>
dplyr::select(where(is.numeric)) |>
pivot_longer(cols = everything(), names_to = "feature", values_to = "value")
ggplot(df_long, aes(x = value)) +
geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
facet_wrap(~feature, scales = "free") +
theme_minimal() +
labs(title = "Distribution of Features")
#log transform skewed variables
final_flights <- final_flights |>
mutate(across(c(arr_flights, enplanements, dev_estimate,
total_snow, weather_delay_rate, total_runway_length), log1p))
#log1p(x) ensures zeros remain well-defined
Next, I’ll examine the correlation between my variables. I observe strong correlations among a few variables (e.g., delay metrics, temperature metrics, and airport infrastructure measures), indicating that some features capture similar patterns in the data. These correlations suggest that PCA can effectively reduce dimensionality by combining highly related variables into principal components. However, since there are more weak than strong correlations, I may need more PCs to retain enough variance for clustering to perform well.
ggcorr(final_flights, label = T)
After removing non-numeric variables and scaling the data, I will run PCA. The results show that PC1 and PC2 capture just under 50% of the variance (31.6% and 16.5%), with the first 10 PCs explaining over 92%. This means I can use fewer dimensions without losing important patterns in the data.
#store airport codes separately
airport_codes <- final_flights$airport_code
#remove airport code for now, since we can only use numeric variables
flights_numeric <- final_flights |>
dplyr::select(-airport_code)
#use PCA
pca = prcomp(flights_numeric, scale = TRUE) #don't forget to scale!
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.3240 1.6548 1.28695 1.25245 1.08414 0.96116 0.89022
## Proportion of Variance 0.3177 0.1611 0.09743 0.09227 0.06914 0.05434 0.04662
## Cumulative Proportion 0.3177 0.4788 0.57621 0.66848 0.73762 0.79197 0.83858
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.77978 0.69370 0.65554 0.63196 0.56158 0.39526 0.3846
## Proportion of Variance 0.03577 0.02831 0.02528 0.02349 0.01855 0.00919 0.0087
## Cumulative Proportion 0.87435 0.90266 0.92794 0.95143 0.96998 0.97917 0.9879
## PC15 PC16 PC17
## Standard deviation 0.32806 0.26989 0.16030
## Proportion of Variance 0.00633 0.00428 0.00151
## Cumulative Proportion 0.99420 0.99849 1.00000
The scree plot further illustrates that there’s a notable drop after the first component and again after the second, confirming these two dimensions are most influential. Components 3 and 4 still contribute meaningfully (9.7% and 9.1%), so I will focus on these four dimensions in my analysis to capture approximately 67% of the total variance.
fviz_eig(pca, addlabels = TRUE, ylim = c(0, 50))
PC1 seems to capture airport size and traffic. Airports with lower PC1 scores tend to be larger, busier hubs with high passenger traffic, more arriving flights, and longer runways (ex. Chicago, Los Angeles, Dallas). In contrast, airports with higher PC1 scores are generally smaller, with fewer flights, shorter runways, and lower passenger volumes.
pca$rotation[,1]
## arr_flights weather_delay_rate carrier_delay_rate
## -0.35929600 0.20226231 0.02366465
## nas_delay_rate security_delay_rate late_aircraft_delay_rate
## -0.22174522 -0.05889220 -0.17825567
## enplanements num_runways elevation_ft
## -0.36866752 -0.26405314 0.18232794
## total_runway_length total_runway_width dev_estimate
## -0.29588589 -0.27938994 -0.29927851
## p90_max_temp p10_min_temp temp_range
## -0.14957849 -0.28120254 0.23721402
## total_rain total_snow
## -0.17311605 0.25154653
#visualization
fviz_contrib(pca, choice = "var", axes = 1)
#top 10 airports for PC1
airport_codes[order(pca$x[,1], decreasing = TRUE)][1:10]
## [1] "SCC" "WYS" "DLG" "EKO" "CYS" "RIW" "CNY" "BRW" "CIU" "DVL"
#bottom 10
airport_codes[order(pca$x[,1])][1:10]
## [1] "ORD" "DFW" "HNL" "IAH" "MIA" "MCO" "ATL" "SFO" "LAX" "CLT"
PC2 captures broad climate characteristics, primarily distinguishing warm, rainy, low-elevation locations from cold, snowy, high-altitude locations. Airports with high PC2 scores, such as those in Key West and Hawaii are located in tropical or coastal regions with consistently warm temperatures and frequent rainfall. In contrast, airports with low PC2 scores, like those in Denver, Minneapolis, and Fairbanks, are situated in colder climates where snowfall and seasonal temperature variation are common.
pca$rotation[,2]
## arr_flights weather_delay_rate carrier_delay_rate
## -0.12648112 -0.06583423 -0.04732765
## nas_delay_rate security_delay_rate late_aircraft_delay_rate
## 0.07740763 0.06048340 -0.02952242
## enplanements num_runways elevation_ft
## -0.09947544 -0.36408573 -0.28080236
## total_runway_length total_runway_width dev_estimate
## -0.35568551 -0.31756166 -0.17678120
## p90_max_temp p10_min_temp temp_range
## 0.12198930 0.38471786 -0.39987598
## total_rain total_snow
## 0.18563527 -0.36019898
#visualization
fviz_contrib(pca, choice = "var", axes = 2)
#top 10 airports for PC2
airport_codes[order(pca$x[,2], decreasing = TRUE)][1:10]
## [1] "EYW" "KOA" "HHH" "ITO" "DRT" "LIH" "PIB" "ECP" "BQK" "STX"
#bottom 10
airport_codes[order(pca$x[,2])][1:10]
## [1] "ORD" "DEN" "DTW" "DFW" "SLC" "BOS" "MSP" "CPR" "MKE" "FAI"
PC3 differentiates airports based on the primary causes of delays. Airports with higher PC3 scores usually experience delays caused by airline scheduling issues, air traffic control restrictions, or inefficiencies in airport turnaround operations rather than external environmental factors. In contrast, airports with lower PC3 scores may have more weather-independent, infrastructure-related disruptions.
pca$rotation[,3]
## arr_flights weather_delay_rate carrier_delay_rate
## -0.001396245 -0.284882521 -0.588426086
## nas_delay_rate security_delay_rate late_aircraft_delay_rate
## -0.433505132 -0.309657767 -0.458043729
## enplanements num_runways elevation_ft
## -0.026778900 0.097034557 -0.136976784
## total_runway_length total_runway_width dev_estimate
## 0.036471775 0.129316422 0.031318289
## p90_max_temp p10_min_temp temp_range
## -0.095976142 0.011219089 -0.079703615
## total_rain total_snow
## 0.123832127 0.004741778
#visualization
fviz_contrib(pca, choice = "var", axes = 3)
#top 10 airports for PC3
airport_codes[order(pca$x[,3], decreasing = TRUE)][1:10]
## [1] "GST" "KTN" "YKM" "PUB" "BTM" "ADK" "YAK" "EAT" "ITO" "SPN"
#bottom 10
airport_codes[order(pca$x[,3])][1:10]
## [1] "PVU" "BQN" "PSE" "SMX" "SFB" "COD" "OWB" "ASE" "PGD" "PSM"
PC4 is also about weather but measures total precipitation, whether from heavy rainfall or snowfall. Unlike PC2, which distinguishes warm, humid regions from cold, snowy ones, PC4 groups both tropical, rainy and airports and cold, snowy airports together due to their high precipitation levels. In contrast, airports with low PC4 scores, such as those in Arizona, Texas, and New Mexico, are defined by their dry conditions, receiving little to no precipitation, regardless of temperature.
pca$rotation[,4]
## arr_flights weather_delay_rate carrier_delay_rate
## 0.060111400 -0.015115273 -0.108672583
## nas_delay_rate security_delay_rate late_aircraft_delay_rate
## 0.238369017 0.155531192 0.123726431
## enplanements num_runways elevation_ft
## 0.119701618 -0.108087760 -0.318482114
## total_runway_length total_runway_width dev_estimate
## -0.141792022 -0.008670956 0.041398756
## p90_max_temp p10_min_temp temp_range
## -0.624625354 -0.187518356 -0.005587357
## total_rain total_snow
## 0.445500333 0.338928307
#visualization
fviz_contrib(pca, choice = "var", axes = 4)
#top 10 airports for PC4
airport_codes[order(pca$x[,4], decreasing = TRUE)][1:10]
## [1] "YAK" "CDV" "BRW" "SIT" "GUM" "ANC" "WRG" "JNU" "PSG" "KTN"
#bottom 10
airport_codes[order(pca$x[,4])][1:10]
## [1] "YUM" "PUB" "SPS" "DRT" "LRD" "ROW" "VCT" "MAF" "HOB" "TUS"
The PC2 vs. PC4 plot below shows how airports vary in climate (PC2) and precipitation levels (PC4). PC2 separates warm, low-elevation airports (right) from cold, high-altitude ones (left), while PC4 distinguishes wet (top) from dry (bottom) locations. Alaskan airports cluster at the top left, reflecting cold temperatures and extreme precipitation, while desert airports like Yuma and Laredo appear at the bottom-right, indicating hot, dry conditions. The scattered, oval-like shape of the data suggests that while climate and precipitation are related, they are not perfectly correlated.
#create a dataframe with PC scores and airport weather data
airport_pca <- data.frame(
PC2 = pca$x[,2],
PC4 = pca$x[,4],
airport_code = airport_codes,
max_temp = final_flights$p90_max_temp,
min_temp = final_flights$p10_min_temp,
total_rain = final_flights$total_rain,
total_snow = final_flights$total_snow
)
#define categories based on temp and precipitation
airport_pca <- airport_pca %>%
mutate(climate_category = case_when(
max_temp < 30.6 & (total_rain > 925.7 | total_snow > 3.6) ~ "Cold + Wet",
max_temp < 30.6 & (total_rain <= 925.7 & total_snow <= 3.6) ~ "Cold + Dry",
max_temp >= 30.6 & total_rain > 925.7 ~ "Hot + Wet",
max_temp >= 30.6 & total_rain <= 925.7 ~ "Hot + Dry"
))
#plot
ggplot(airport_pca, aes(x = PC2, y = PC4, label = airport_code, color = climate_category)) +
geom_text(size = 2.5, alpha = 0.8) +
geom_point(aes(x = 0, y = 0), size = 0) + #dummy points to generate legend correctly
scale_color_manual(values = c("Cold + Wet" = "blue",
"Cold + Dry" = "purple",
"Hot + Wet" = "green",
"Hot + Dry" = "red")) +
labs(title = "PC2 vs. PC4: Airport Climate and Precipitation",
x = "PC2 (Climate & Geography)",
y = "PC4 (Wet vs. Dry)",
color = "Climate Category") + # Correct legend title
theme_minimal() +
theme(legend.position = "right",
legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 10, face = "plain")) +
guides(color = guide_legend(override.aes = list(size = 4, shape = 16)))
My next step is to apply clustering techniques. Clustering is a fundamental tool in unsupervised learning that groups similar observations together based on certain characteristics. In this section, I will use two methods: K-means and hierarchical clustering, to explore the structure of my data and gain insights into distinct clusters of airports.
In clustering, selecting the appropriate number of clusters is one of the most important steps. Therefore, I will use three graphing methods as hints to find the optimal number. These results suggest anywhere between 2-6 clusters. To me, 2 seems to simple, while 6 seems like too many clusters to interpret. Therefore, I will test k = 3, 4, and 5 since they provide a good balance between detailed groupings and interpretability.
#elbow method
fviz_nbclust(flights_numeric, kmeans, method = "wss", k.max = 10, nstart = 25) +
theme_minimal() + ggtitle("Elbow Method")
#silhouette method
fviz_nbclust(flights_numeric, kmeans, method = "silhouette", k.max = 10, nstart = 25) +
theme_minimal() + ggtitle("Silhouette Method")
#gap statistic
fviz_nbclust(flights_numeric, kmeans, method = "gap_stat", k.max = 10, nstart = 25, nboot=100) + theme_minimal() + ggtitle("Gap Statistic")
## Warning: did not converge in 10 iterations
I will start by using K-means to create 3 clusters:
#kmeans with 3 clusters
fit_km_3 <- kmeans(scale(flights_numeric), centers = 3, nstart = 1000) #remember to scale
fit_km_3
## K-means clustering with 3 clusters of sizes 158, 128, 73
##
## Cluster means:
## arr_flights weather_delay_rate carrier_delay_rate nas_delay_rate
## 1 -0.57006563 0.37205793 0.05258980 -0.4189479
## 2 -0.09700057 -0.07863901 -0.02305512 0.1706527
## 3 1.40392388 -0.66738848 -0.07339908 0.6075373
## security_delay_rate late_aircraft_delay_rate enplanements num_runways
## 1 -0.1426322 -0.252633822 -0.59835859 -0.2521719
## 2 0.0770433 0.004289876 -0.04983412 -0.3535042
## 3 0.1736212 0.539274517 1.38245787 1.1656399
## elevation_ft total_runway_length total_runway_width dev_estimate p90_max_temp
## 1 0.5131003 -0.2944627 -0.3062242 -0.4233261 -0.4968538
## 2 -0.4804279 -0.3805340 -0.3072958 -0.1890934 0.4677617
## 3 -0.2681518 1.3045680 1.2016065 1.2478011 0.2551973
## p10_min_temp temp_range total_rain total_snow
## 1 -0.7728586 0.7310078 -0.4203869 0.8543882
## 2 0.7141211 -0.6993542 0.2798469 -0.8576866
## 3 0.4206050 -0.3559165 0.4191881 -0.3453350
##
## Clustering vector:
## [1] 2 2 3 1 1 2 2 2 2 1 1 2 2 1 2 1 1 1 3 1 1 1 3 1 3 2 1 2 1 3 1 1 2 1 1 2 1
## [38] 1 1 1 2 2 1 3 3 3 2 2 1 1 2 1 1 2 1 1 2 3 1 2 2 1 1 2 2 2 1 1 2 3 2 3 3 1
## [75] 1 1 1 1 2 1 2 2 2 3 1 1 2 3 3 3 1 1 3 3 2 1 1 1 1 2 3 3 1 1 2 1 1 1 3 1 1
## [112] 3 2 2 3 2 1 1 2 2 1 1 3 2 1 1 1 2 1 1 1 1 1 2 1 2 2 1 1 2 1 3 2 1 1 2 1 1
## [149] 1 2 2 1 1 3 1 3 2 2 2 2 1 1 3 1 3 3 1 2 1 3 1 3 1 2 1 2 2 3 2 1 1 1 2 3 1
## [186] 1 3 2 3 2 2 1 1 2 2 2 2 3 3 2 3 1 2 1 1 2 3 1 3 3 1 2 3 2 3 2 1 2 1 1 3 3
## [223] 2 1 2 2 1 1 2 1 1 3 3 1 2 2 2 3 2 3 3 1 3 3 2 1 1 1 2 2 1 3 3 2 2 2 3 3 1
## [260] 2 3 1 3 1 2 2 1 1 2 2 1 1 2 1 1 2 1 1 1 2 1 3 1 1 3 1 1 1 2 1 1 1 2 1 2 3
## [297] 2 2 1 2 2 1 1 2 3 3 3 3 2 1 1 2 1 3 2 3 3 1 2 2 2 1 1 2 2 1 3 2 2 2 1 1 1
## [334] 2 1 2 1 3 2 2 3 3 1 1 2 2 2 2 2 1 2 2 1 1 2 1 1 1 2
##
## Within cluster sum of squares by cluster:
## [1] 1982.8119 1311.8761 939.9635
## (between_SS / total_SS = 30.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The clusters reveal three main types of airports: warm climate airports with minimal seasonal variation, high-altitude, snowy airports that face weather delays, and large airports with extensive flight operations and logistical delays. These clusters already seem well defined to me, but let’s experiment with a few more.
centers=fit_km_3$centers
k <- 3
par(mfrow = c(2, 3))
for (i in 1:k) {
colors <- ifelse(centers[i, ] > 0, "blue", "red")
barplot(centers[i, ],
main = paste("Cluster", i, "Center"),
las = 2,
col = colors,
ylim = c(min(centers), max(centers)))
}
par(mfrow = c(1, 1))
Next I will try 4 clusters:
#kmeans with 4 clusters
fit_km_4 <- kmeans(scale(flights_numeric), centers = 4, nstart = 1000) #remember to scale
fit_km_4
## K-means clustering with 4 clusters of sizes 86, 95, 104, 74
##
## Cluster means:
## arr_flights weather_delay_rate carrier_delay_rate nas_delay_rate
## 1 -0.63753206 0.5136642 0.10978267 -0.769854685
## 2 -0.49263663 0.2120744 0.18117424 0.173886059
## 3 -0.01478396 -0.1822898 -0.22417683 -0.005744789
## 4 1.39413202 -0.6130277 -0.04511448 0.679537640
## security_delay_rate late_aircraft_delay_rate enplanements num_runways
## 1 -0.23163183 -0.6628812 -0.75592317 -0.2872743
## 2 -0.01811143 0.3127817 -0.37838950 -0.2334441
## 3 0.06754386 -0.1396011 -0.01306182 -0.3807515
## 4 0.19751841 0.5650274 1.38263277 1.1686614
## elevation_ft total_runway_length total_runway_width dev_estimate p90_max_temp
## 1 1.2357333 -0.3422961 -0.4325623 -0.4399712 -0.1296358
## 2 -0.3716521 -0.2913890 -0.1330726 -0.4163104 -0.8253057
## 3 -0.4944957 -0.3778767 -0.3093666 -0.1583045 0.6323749
## 4 -0.2640345 1.3029541 1.1083295 1.2682524 0.3214287
## p10_min_temp temp_range total_rain total_snow
## 1 -0.9032883 0.9729126 -0.9368287 0.7758260
## 2 -0.4478841 0.2812126 0.3379386 0.7750090
## 3 0.8362024 -0.8040338 0.2096475 -1.0669088
## 4 0.4495506 -0.3617050 0.3602670 -0.3971403
##
## Clustering vector:
## [1] 2 3 4 1 2 2 3 2 3 2 2 3 3 2 2 2 1 1 4 1 2 1 4 2 4 3 2 4 2 4 2 1 3 2 2 3 1
## [38] 1 1 1 2 2 2 4 4 4 3 3 2 1 3 2 1 3 2 2 3 4 1 3 2 1 2 3 3 3 2 1 2 4 3 4 4 2
## [75] 2 1 1 1 3 1 3 3 2 4 2 1 3 4 4 4 1 2 4 4 3 1 2 1 1 3 4 4 1 1 3 1 1 1 4 2 1
## [112] 4 3 3 4 3 2 1 3 3 1 1 4 3 2 1 2 3 2 1 1 1 1 3 1 3 3 2 1 3 2 4 3 2 1 3 1 2
## [149] 1 2 3 1 1 4 1 4 2 3 3 3 2 1 4 2 4 4 1 3 1 4 1 4 2 3 1 3 3 4 2 1 2 2 3 2 2
## [186] 1 4 3 4 3 2 1 1 3 2 3 3 4 4 3 4 2 3 2 1 3 4 2 4 4 1 3 4 3 4 3 1 3 1 2 4 4
## [223] 3 2 3 3 1 2 3 2 1 4 4 1 2 3 3 4 3 4 4 2 4 4 3 2 2 2 2 2 2 4 4 4 3 3 4 4 2
## [260] 3 4 1 4 1 3 3 2 1 3 2 2 2 3 1 1 2 2 2 1 3 1 4 2 1 4 1 1 1 3 2 1 2 3 1 3 4
## [297] 3 3 2 3 3 1 2 3 4 4 4 4 2 1 1 3 2 4 3 4 4 2 3 2 3 2 2 3 3 1 4 3 3 3 1 1 2
## [334] 3 2 3 2 4 3 2 4 4 2 1 3 3 3 3 3 1 3 3 2 1 3 1 2 1 3
##
## Within cluster sum of squares by cluster:
## [1] 848.4995 1298.1380 846.4786 917.3262
## (between_SS / total_SS = 35.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
In addition to the three clusters identified above, adding a fourth cluster introduces average-performance airports with no extreme metrics across any dimension.
centers=fit_km_4$centers
k <- 4
par(mfrow = c(2, 3))
for (i in 1:k) {
colors <- ifelse(centers[i, ] > 0, "blue", "red")
barplot(centers[i, ],
main = paste("Cluster", i, "Center"),
las = 2,
col = colors,
ylim = c(min(centers), max(centers)))
}
par(mfrow = c(1, 1))
Finally I will try 5 clusters:
#kmeans with 5 clusters
fit_km_5 <- kmeans(scale(flights_numeric), centers = 5, nstart = 1000) #remember to scale
fit_km_5
## K-means clustering with 5 clusters of sizes 63, 91, 19, 101, 85
##
## Cluster means:
## arr_flights weather_delay_rate carrier_delay_rate nas_delay_rate
## 1 1.50402600 -0.6752126 -0.08554026 0.6333431
## 2 -0.38800958 0.1192294 -0.14606783 0.0105287
## 3 -0.38856162 0.5504262 1.35563832 1.8983855
## 4 0.05967278 -0.2652487 -0.23497730 -0.1007363
## 5 -0.68340054 0.5649475 0.19596222 -0.7853374
## security_delay_rate late_aircraft_delay_rate enplanements num_runways
## 1 0.02824476 0.5323328 1.47144195 1.3172644
## 2 -0.24545493 0.1937804 -0.30555207 -0.2051233
## 3 2.59658778 0.7144830 -0.02399725 -0.5675244
## 4 -0.08289136 -0.1018617 0.04478250 -0.3166098
## 5 -0.24007249 -0.6406838 -0.81132576 -0.2536574
## elevation_ft total_runway_length total_runway_width dev_estimate p90_max_temp
## 1 -0.2747139 1.4044523 1.25932497 1.3431802 0.25254788
## 2 -0.3841187 -0.2544146 -0.09220965 -0.3280913 -0.81192527
## 3 -0.3902538 -0.4995992 -0.47034083 -0.2262770 -0.02804604
## 4 -0.4465577 -0.2667985 -0.25070169 -0.0909986 0.70536984
## 5 1.2326933 -0.3398793 -0.43163587 -0.4855756 -0.14982113
## p10_min_temp temp_range total_rain total_snow
## 1 0.4061053 -0.3456692 0.46435822 -0.3673660
## 2 -0.4571650 0.3041411 0.36813485 0.7926992
## 3 0.3583323 -0.4691589 0.09902561 -0.3005728
## 4 0.8608186 -0.7919514 0.14139480 -1.0860830
## 5 -0.9145131 0.9764875 -0.92843766 0.7813375
##
## Clustering vector:
## [1] 3 4 1 5 2 2 3 2 3 2 2 4 4 2 2 2 5 4 3 5 2 5 1 2 4 3 2 4 2 1 2 5 4 2 2 4 5
## [38] 5 5 5 2 2 2 1 1 1 3 4 3 5 4 2 5 4 2 2 4 1 5 4 2 5 2 4 4 4 2 5 2 1 4 1 1 2
## [75] 2 5 5 5 2 5 4 4 2 1 2 5 4 1 1 1 5 2 1 1 4 5 2 5 5 4 2 1 5 5 4 5 5 5 1 2 5
## [112] 2 2 4 1 4 2 5 4 4 5 5 1 4 2 5 2 4 2 5 5 2 5 4 5 4 4 2 5 4 2 1 4 2 5 4 5 2
## [149] 5 2 4 5 5 1 5 1 2 4 4 4 2 5 1 2 1 1 5 4 5 1 5 1 2 4 5 4 4 1 2 5 2 5 4 2 2
## [186] 5 1 4 1 4 3 5 5 4 2 4 4 1 1 4 1 2 4 2 5 4 4 2 1 1 5 4 1 4 1 4 2 4 5 2 1 1
## [223] 4 2 4 4 5 2 4 2 5 1 1 5 2 4 4 1 4 1 1 2 4 1 4 2 2 2 3 2 2 1 1 3 4 4 1 1 2
## [260] 4 1 5 1 5 4 4 2 5 4 3 2 3 4 5 5 2 3 2 5 4 5 1 2 5 1 5 5 5 4 2 5 2 4 5 4 1
## [297] 4 4 2 4 4 5 2 4 1 1 3 1 2 5 5 4 2 1 4 3 1 5 4 3 4 2 2 4 4 5 1 4 3 3 5 5 2
## [334] 4 2 4 2 1 4 2 1 4 2 5 4 4 4 3 4 5 4 4 2 5 4 5 2 5 4
##
## Within cluster sum of squares by cluster:
## [1] 692.0667 924.7528 424.5258 773.5134 873.6018
## (between_SS / total_SS = 39.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The five-cluster solution introduces a delay-prone cluster with systemic issues across all delay metrics. The other clusters remain largely the same as those identified in the four-cluster solution.
centers=fit_km_5$centers
k <- 5
par(mfrow = c(2, 3))
for (i in 1:k) {
colors <- ifelse(centers[i, ] > 0, "blue", "red")
barplot(centers[i, ],
main = paste("Cluster", i, "Center"),
las = 2,
col = colors,
ylim = c(min(centers), max(centers)))
}
par(mfrow = c(1, 1))
While the additional clusters are interesting, they are increasingly harder for me to interpret as those with small variation from 0 started to blend with one another. I would therefore select just 3 clusters for simplicity of interpretation. However, airport authorities or airlines conducting more granular operational analysis might prefer additional clusters to capture subtle variations. The balance of granularity versus interpretability always depends on the researcher and their objectives.
I created two graphs to visualize both the statistical clustering and geographic distribution of airport types across the United States. The PCA plot (first image) demonstrates how the airports separate into three distinct groups based on their operational characteristics, while the map (second image) reveals clear regional patterns in these clusters. Together, they show how climate, geography, and infrastructure requirements create natural groupings of airports, with warm-weather facilities dominating the South, high-volume hubs concentrated on the coasts and major cities, and snowy, high-altitude airports clustered in northern and mountainous regions.
#convert PCA results into a data frame
pca_df <- as.data.frame(pca$x)
#create a clustering object from hierarchical clustering
cluster_kmeans <- as.factor(fit_km_3$cluster)
#create visualization using PC1 and PC2
fviz_cluster(list(data = pca_df[, 1:2], cluster = cluster_kmeans),
geom = "point",
ellipse.type = "norm",
pointsize = 1) +
theme_minimal() +
geom_text(aes(label = airport_codes), hjust = 0, vjust = 0, size = 2, check_overlap = TRUE) +
scale_fill_brewer(palette = "Paired")
#restore latitude and longitude
mapping_flights <- final_flights |>
left_join(airport_locations, by = "airport_code")
#create a clustered data frame with airport codes
clustered_data <- pca_df |>
mutate(cluster = cluster_kmeans, # Add cluster assignments
airport_code = mapping_flights$airport_code) # Reattach airport codes
#merge clusters into mapping data
mapping_flights <- mapping_flights |>
left_join(clustered_data |> dplyr::select(airport_code, cluster) |> distinct(),
by = "airport_code")
#filter for U.S. airports
us_airports <- mapping_flights |>
filter(longitude > -130, longitude < -60, latitude > 20, latitude < 55)
#create a color palette
pal <- colorFactor(palette = c("orange", "green", "blue"), domain = us_airports$cluster)
#create a map
leaflet(us_airports) |>
addProviderTiles("CartoDB.Positron") |>
addCircleMarkers(
lng = ~longitude, lat = ~latitude,
color = ~pal(cluster),
radius = 5,
label = ~airport_code,
fillOpacity = 0.8
) |>
addLegend(
"bottomright", pal = pal, values = ~cluster,
title = "K-Means Clusters",
opacity = 1
) |>
setView(lng = -98, lat = 39.5, zoom = 4) |>
fitBounds(
lng1 = -125, lat1 = 25,
lng2 = -65, lat2 = 50
)
Next I will focus on hierarchical clustering, which is another clustering technique to group similar observations. This method organizes the data in a hierarchical structure, allowing us to explore potential hierarchical relationships within the data.
When using hierarchical clustering, it is essential to test different distance metrics and linkage methods to find the best clustering approach. I will try Euclidean and Manhattan distances since they are standard and reliable for numeric data. For linkage, I will try complete, average, and Ward’s.
#scale data
flights_scaled <- scale(flights_numeric)
#compute euclidean distance
d_euclidean <- dist(flights_scaled, method = "euclidean")
d_manhattan <- dist(flights_scaled, method = "manhattan")
#apply clustering using different linkage methods
#euclidean distance
hc_euclidean_complete <- hclust(d_euclidean, method = "complete")
hc_euclidean_average <- hclust(d_euclidean, method = "average")
hc_euclidean_ward <- hclust(d_euclidean, method = "ward.D2")
#manhattan distance
hc_manhattan_complete <- hclust(d_manhattan, method = "complete")
hc_manhattan_average <- hclust(d_manhattan, method = "average")
hc_manhattan_ward <- hclust(d_manhattan, method = "ward.D2")
#TEMPORARILY set k=5, just to color the dendrograms
k <- 5 #adjust as needed
#create dendrogram plots
d1 <- fviz_dend(hc_euclidean_complete, k = k, rect = TRUE, rect_fill = TRUE, show_labels = FALSE, main = "Euclidean + Complete")
d2 <- fviz_dend(hc_euclidean_average, k = k, rect = TRUE, rect_fill = TRUE, show_labels = FALSE, main = "Euclidean + Average")
d3 <- fviz_dend(hc_euclidean_ward, k = k, rect = TRUE, rect_fill = TRUE, show_labels = FALSE, main = "Euclidean + Ward")
d4 <- fviz_dend(hc_manhattan_complete, k = k, rect = TRUE, rect_fill = TRUE, show_labels = FALSE, main = "Manhattan + Complete")
d5 <- fviz_dend(hc_manhattan_average, k = k, rect = TRUE, rect_fill = TRUE, show_labels = FALSE, main = "Manhattan + Average")
d6 <- fviz_dend(hc_manhattan_ward, k = k, rect = TRUE, rect_fill = TRUE, show_labels = FALSE, main = "Manhattan + Ward")
#arrange plots
grid.arrange(d1, d2, d3, d4, d5, d6, ncol = 3)
Based on the dendogram plots above, Euclidean + Ward and Manhattan + Ward produce the clearest results. I will compare these combinations by looking at how well the clusters are separated. Manhattan distance with Ward’s method performs better because it creates three clusters with notably higher silhouette scores (0.18, 0.25, and 0.22 compared to 0.12, 0.16, and 0.20), indicating stronger, more distinct groupings. Though one cluster has a poor 0.00 score, suggesting points that sit right between clusters, the superior quality of the other groupings outweighs this weakness when considering the overall clustering effectiveness.
#compute silhouette scores for best linkage (Ward’s)
sil_euclidean_ward <- silhouette(cutree(hc_euclidean_ward, k), d_euclidean)
sil_manhattan_ward <- silhouette(cutree(hc_manhattan_ward, k), d_manhattan)
#visualize silhouette scores
s1 <- fviz_silhouette(sil_euclidean_ward) + ggtitle("Euclidean + Ward")
## cluster size ave.sil.width
## 1 1 89 0.07
## 2 2 107 0.03
## 3 3 52 0.12
## 4 4 76 0.16
## 5 5 35 0.20
s2 <- fviz_silhouette(sil_manhattan_ward) + ggtitle("Manhattan + Ward")
## cluster size ave.sil.width
## 1 1 77 0.07
## 2 2 101 0.00
## 3 3 88 0.18
## 4 4 58 0.25
## 5 5 35 0.22
#arrange plots
grid.arrange(s1, s2, ncol = 2)
Next I will test the best number of clusters. The average Silhouette of k = 3 is the largest, so this is likely the best number of clusters.
#test average silhouette
for (k_test in 3:5) {
clusters_hc <- cutree(hc_manhattan_ward, k = k_test)
sil <- silhouette(clusters_hc, d_manhattan)
avg_sil <- mean(sil[, "sil_width"])
cat("k =", k_test, " --> Avg Silhouette =", round(avg_sil, 3), "\n")
}
## k = 3 --> Avg Silhouette = 0.156
## k = 4 --> Avg Silhouette = 0.123
## k = 5 --> Avg Silhouette = 0.121
Here is the final dendogram with 3 clusters, using Manhattan distance and Ward linkage.
#use final k=3
final_clusters_hc <- cutree(hc_manhattan_ward, k = 3)
#visualize final dendrogram with k=5
fviz_dend(hc_manhattan_ward,
k = 3,
rect = TRUE, rect_fill = TRUE,
main = "Final Hierarchical Clustering (Manhattan + Ward, k=3)")
I chose to visualize my hierarchical clustering analysis using three complementary graphs to reveal the natural groupings in airport characteristics across the United States. I first created a phylogenetic tree displays the evolutionary relationships between airports, showing how they branch into three distinct clusters based on their operational similarities. It is also much clearer to read than the above dendogram.
#convert PCA results into a data frame
pca_df <- as.data.frame(pca$x)
#assign airport codes as row names
if (length(airport_codes) == nrow(pca_df)) {
rownames(pca_df) <- airport_codes
} else {
stop("Length of airport_codes does not match the number of rows in pca_df")
}
#ensure hierarchical clustering object retains airport codes as labels
hc_manhattan_ward$labels <- rownames(pca_df)
#phylogenetic tree
fviz_dend(x = hc_manhattan_ward,
k = 3, # 3 clusters
color_labels_by_k = TRUE,
cex = 0.8,
type = "phylogenic",
repel = TRUE,
show_labels = TRUE) +
labs(title = "Phylogenetic Tree of Airports") +
theme(axis.text.x = element_text(size = 6), axis.text.y = element_blank())
Next, I used the same two graph types as for K-means in order to visually compare these methods. At a first look, the clusters look pretty similar in terms of geographic distribution and overall patterns, but I will explore this more in the next section.
#cluster plot
fviz_cluster(list(data = pca_df[, 1:2], cluster = final_clusters_hc),
geom = "point",
ellipse.type = "norm",
pointsize = 1) +
theme_minimal() +
geom_text(aes(label = airport_codes), hjust = 0, vjust = 0, size = 2, check_overlap = TRUE) +
scale_fill_brewer(palette = "Paired")
#create a new column with remapped hierarchical clusters to match k-means numbering
clustered_data <- pca_df |>
mutate(
hcluster = final_clusters_hc,
# Remap cluster numbers to match K-means cluster numbering
hcluster_remapped = case_when(
final_clusters_hc == 1 ~ 1,
final_clusters_hc == 2 ~ 3,
final_clusters_hc == 3 ~ 2
),
airport_code = mapping_flights$airport_code
)
#merge clusters into mapping data
mapping_flights <- mapping_flights |>
left_join(clustered_data |> dplyr::select(airport_code, hcluster_remapped) |> distinct(),
by = "airport_code")
#filter for U.S. airports
us_airports <- mapping_flights |>
filter(longitude > -130, longitude < -60, latitude > 20, latitude < 55)
#create a color palette for clusters - use the SAME colors as in your K-means visualization
pal <- colorFactor(palette = c("orange", "green", "blue"), domain = c(1, 2, 3))
#create a map
leaflet(us_airports) |>
addProviderTiles("CartoDB.Positron") |>
addCircleMarkers(
lng = ~longitude, lat = ~latitude,
color = ~pal(hcluster_remapped),
radius = 5,
label = ~airport_code,
fillOpacity = 0.8
) |>
addLegend(
"bottomright", pal = pal, values = ~hcluster_remapped,
title = "Hierarchical Clusters",
opacity = 1
) |>
setView(lng = -98, lat = 39.5, zoom = 4) |>
fitBounds(
lng1 = -125, lat1 = 25,
lng2 = -65, lat2 = 50
)
First, let’s see how K-Means and Hierarchical Clustering compare in their assignments. K-Means Cluster 1 mostly overlaps with Hierarchical Cluster 1 (155 points), while K-Means Cluster 2 is split between Hierarchical Clusters 1 and 2. K-Means Cluster 3 is the most dispersed but primarily aligns with Hierarchical Cluster 3.
table(fit_km_3$cluster, final_clusters_hc)
## final_clusters_hc
## 1 2 3
## 1 155 3 0
## 2 73 55 0
## 3 8 30 35
Now, let’s look into why these distributions are different. Overall, hierarchical seems to follow the same general pattern as our earlier K-means clusters (snowy-high altitude airports, warm and temperate airports, large airports). However, K-Means creates more distinct and extreme feature values in each cluster, emphasizing stronger separation. In contrast, Hierarchical Clustering preserves more gradual variations, leading to less extreme but more structured clusters.
#scale numeric features and remove cluster labels
flights_scaled <- scale(flights_numeric)
flights_scaled_clean <- flights_scaled[, !colnames(flights_scaled) %in% c("kmeans_cluster", "hierarchical_cluster")]
#K-Means clustering
set.seed(123)
k <- 3
fit_km_3 <- kmeans(flights_scaled_clean, centers = k, nstart = 1000)
kmeans_centers <- fit_km_3$centers
#hierarchical clustering
hc_manhattan_ward <- hclust(dist(flights_scaled_clean, method = "manhattan"), method = "ward.D2")
final_clusters_hc <- cutree(hc_manhattan_ward, k = k)
#compute mean feature values for hierarchical clusters
hc_centers <- aggregate(flights_scaled_clean, by = list(final_clusters_hc), mean)[, -1]
hc_centers <- as.matrix(hc_centers)
#create bar plots for K-Means and hierarchical clusters
par(mfrow = c(2, k))
for (i in 1:k) {
barplot(as.numeric(kmeans_centers[i, ]),
main = paste("K-Means Cluster", i),
las = 2, col = ifelse(kmeans_centers[i, ] > 0, "blue", "red"),
ylim = range(kmeans_centers),
names.arg = colnames(kmeans_centers), cex.names = 0.7)
}
for (i in 1:k) {
barplot(as.numeric(hc_centers[i, ]),
main = paste("Hierarchical Cluster", i),
las = 2, col = ifelse(hc_centers[i, ] > 0, "blue", "red"),
ylim = range(hc_centers),
names.arg = colnames(hc_centers), cex.names = 0.7)
}
par(mfrow = c(1, 1))
Overall, I prefer K-Means because the well-separated clusters are easier to interpret and compare across groups. However, Hierarchical Clustering may be better if the goal is to capture underlying relationships and gradual transitions between data points. Below is a 3D graph that visualizes these distinct K-Means clusters, clearly highlighting the minimal overlap between them.This shows how K-Means effectively differentiates airports with minimal overlap, making patterns in the data more apparent.
#convert PCA results to a data frame
pca_df <- as.data.frame(pca$x[, 1:3])
#add cluster labels and airport names
pca_df$Cluster <- as.factor(fit_km_3$cluster)
pca_df$Airport <- airport_codes
#function to create ellipsoid for each cluster
create_ellipsoid <- function(data, cluster_id, level = 0.95) {
cluster_data <- data[data$Cluster == cluster_id, c("PC1", "PC2", "PC3")]
mean_values <- colMeans(cluster_data)
cov_matrix <- cov(cluster_data)
theta <- seq(0, 2*pi, length.out = 25)
phi <- seq(0, pi, length.out = 25)
x_unit <- outer(cos(theta), sin(phi))
y_unit <- outer(sin(theta), sin(phi))
z_unit <- outer(rep(1, length(theta)), cos(phi))
eigen_decomp <- eigen(cov_matrix)
scale_factor <- sqrt(qchisq(level, df = 3))
radius <- scale_factor * sqrt(eigen_decomp$values)
x_ellipsoid <- matrix(0, nrow = length(theta), ncol = length(phi))
y_ellipsoid <- matrix(0, nrow = length(theta), ncol = length(phi))
z_ellipsoid <- matrix(0, nrow = length(theta), ncol = length(phi))
for (i in 1:length(theta)) {
for (j in 1:length(phi)) {
vec <- c(x_unit[i,j], y_unit[i,j], z_unit[i,j])
transformed <- eigen_decomp$vectors %*% diag(radius) %*% vec + mean_values
x_ellipsoid[i,j] <- transformed[1]
y_ellipsoid[i,j] <- transformed[2]
z_ellipsoid[i,j] <- transformed[3]
}
}
return(list(x = x_ellipsoid, y = y_ellipsoid, z = z_ellipsoid))
}
#get unique clusters
clusters <- unique(pca_df$Cluster)
#create a consistent color palette
cluster_colors <- c(
"#1f77b4",
"#ff7f0e",
"#2ca02c"
)
#create base plot
plot_3D <- plot_ly() %>%
layout(title = "3D Cluster Plot",
scene = list(xaxis = list(title = "PC1"),
yaxis = list(title = "PC2"),
zaxis = list(title = "PC3")))
#add labels and ellipsoids
for (i in seq_along(clusters)) {
cluster_id <- clusters[i]
color <- cluster_colors[i]
#get data for this cluster
cluster_data <- pca_df[pca_df$Cluster == cluster_id, ]
#add text markers
plot_3D <- plot_3D %>% add_trace(
data = cluster_data,
x = ~PC1,
y = ~PC2,
z = ~PC3,
type = "scatter3d",
mode = "text",
text = ~Airport,
textfont = list(
size = 10,
color = color
),
name = paste("Cluster", cluster_id),
hoverinfo = "text",
showlegend = TRUE
)
#create and add ellipsoid surface for this cluster
ellipsoid <- create_ellipsoid(pca_df, cluster_id, level = 0.80)
#add ellipsoid with consistent color and transparency
plot_3D <- plot_3D %>% add_trace(
x = ellipsoid$x,
y = ellipsoid$y,
z = ellipsoid$z,
type = "surface",
opacity = 0.15,
colorscale = list(c(0, 1), c(color, color)),
showscale = FALSE,
name = paste("Ellipsoid", cluster_id),
hoverinfo = "none",
showlegend = FALSE
)
}
plot_3D
This unsupervised learning analysis identified distinct patterns and natural groupings among U.S. airports using PCA and clustering techniques. PCA extracted several principal components, with the first four capturing 66% of total variance. While additional components contribute to the remaining variation, this study focused on the first four as they represent the most interpretable patterns:
Airport size and traffic volume (PC1): Differentiates large hub airports with high passenger traffic, long runways, and frequent congestion from smaller regional airports.
Climate and geography (PC2): Separates warm airports from colder, high-altitude locations, reflecting how geography influences operations.
Delay cause patterns (PC3): Identifies whether delays stem from operational constraints like air traffic control or external environmental factors.
Precipitation levels (PC4): Captures extreme rainfall or snowfall as a distinct factor affecting airport conditions.
The clustering analysis using K-Means and Hierarchical Clustering further refines these insights by grouping airports into three distinct categories:
Snowy, high-altitude airports: Airports at higher elevations experiencing significant weather delays, wide temperature ranges, and substantial snowfall.
Large, high-volume airports: Busy airports with high flight volumes, large budgets, and extensive infrastructure.
Warm-weather airports: Airports characterized by high temperatures year-round with little temperature fluctuation or snowfall.
While both methods produced similar clusters, K-Means created more clearly separated groups, making distinctions between airport types more interpretable. Hierarchical Clustering, in contrast, preserved gradual variations, reflecting underlying relationships between features.
These findings offer a data-driven framework for decision-making in airport management, policy, and airline operations. Airport authorities can use this classification to optimize infrastructure investments and air traffic management based on airport type, while airlines can adjust flight scheduling and route planning for better operational efficiency. Transportation planners can leverage these insights to improve regional air connectivity and network performance.
While this study provides a foundation for understanding the structure of U.S. airports, future research could explore additional principal components, track how airport clusters evolve over time, and extend the analysis to international airport systems to examine regulatory and infrastructure differences.
Bureau of Transportation Statistics. (n.d.). On-time performance and delay causes. U.S. Department of Transportation. Retrieved from https://www.transtats.bts.gov/ot_delay/OT_DelayCause1.asp?20=E
Federal Aviation Administration. (n.d.-a). Passenger boarding (enplanement) and all-cargo data. U.S. Department of Transportation. Retrieved from https://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger
Federal Aviation Administration. (n.d.-b). National plan of integrated airport systems (NPIAS) – Historical data. U.S. Department of Transportation. Retrieved from https://www.faa.gov/airports/planning_capacity/npias/current/historical
OpenFlights. (n.d.). Global airport database. Retrieved from https://github.com/jpatokal/openflights/blob/master/data/airports.dat
Open-Meteo. (n.d.). Weather data API. Retrieved from https://open-meteo.com/
OurAirports. (n.d.). Worldwide airport and aviation data. Retrieved from https://ourairports.com/data/